ggplot(data = births14,
mapping = aes(x = mage, y = weight)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Mother's Age",
y = "Birth Weight of Baby (lbs)")Week 7 – Confidence Intervals for the Slope
This week’s reading is a compilation of Chapter 8 from ModernDive (Kim et al. 2020), Chapter 24 from Introduction to Modern Statistics (Çetinkaya-Rundel and Hardin 2023), with a smattering of my own ideas.
0.1 Sampling Review
In the last reading, we studied the concept of sampling variation. Using the example of estimating the proportion of red balls in a bowl, we started with a “tactile” exercise where a shovel was used to draw a sample of balls from the bowl. While we could have performed an exhaustive count of all the balls in the bowl, this would have been a tedious process. So instead, we used a shovel to extract a sample of balls and used the resulting proportion that were red as an estimate. Furthermore, we made sure to mix the bowl’s contents before every use of the shovel. Because of the randomness created by the mixing, different uses of the shovel yielded different proportions red and hence different estimates of the proportion of the bowl’s balls that are red.
We then used R to mimic this “tactile” sampling process. Using our computer’s random number generator, we were able to quickly mimic the tactile sampling procedure a large number of times. Moreover, we were able to explore how different our results would be if we used different sized shovels, with 25, 50, and 100 slots. When we visualized the results of these three different shovel sizes, we saw that as the sample size increased, the variation in the estimates (\(\widehat{p}\)) decreased.
These visualizations of the repeated sampling from the bowl have a special name in Statistics – a sampling distribution. These distributions all us to study how our estimates ($) varied from one sample to another; in other words, we wanted to study the effect of sampling variation. Once we had over 1,000 different estimates, we quantified the variation of these estimates using their standard deviation, which also has a special name in Statistics – the standard error. Visually we saw the spread of the sampling distributions get narrower as the sample size increased, which was reiterated by the standard errors – the standard errors decreased as the sample size increased. This decrease in spread of the sampling distribution gives us more precise estimates that varied less around the center.
We then tied these sampling concepts to the statistical terminology and mathematical notation related to sampling. Our study population was the large bowl with \(N\) = 2400 balls, while the population parameter, the unknown quantity of interest, was the population proportion \(p\) of the bowl’s balls that were red. Since performing a census would be expensive in terms of time and energy, we instead extracted a sample of size \(n\) = 50. The point estimate, also known as a sample statistic, used to estimate \(p\) was the sample proportion \(\widehat{p}\) of these 50 sampled balls that were red. Furthermore, since the sample was obtained at random, it can be considered as unbiased and representative of the population. Thus any results based on the sample could be generalized to the population. Therefore, the proportion of the shovel’s balls that were red was a “good guess” of the proportion of the bowl’s balls that are red. In other words, we used the sample to infer about the population.
However, we acknowledged that both the tactile and virtual sampling exercises are not what one would do in real life; this was merely an activity used to study the effects of sampling variation. In a real-life situation, we would not take 1,000s of samples of size \(n\), but rather take a single representative sample that’s as large as possible. Additionally, we knew that the true proportion of the bowl’s balls that were red was 37.5%. In a real-life situation, we will not know what this value is. Because if we did, then why would we take a sample to estimate it?
So how does one quantify the effects of sampling variation when you only have a single sample to work with? You cannot directly study the effects of sampling variation when you only have one sample. One common method to study this is bootstrapping resampling, which will be the focus of the earlier sections of this reading.
Furthermore, what if we would like not only a single estimate of the unknown population parameter, but also a range of highly plausible values? For example, when you read about political polls, they tell you the percent of all Californians who support a specific measure, but in addition to this estimate they provide the poll’s “margin of error”. This margin of error can be used to construct a range of plausible values for the true percentage of people who support a specific measure. This range of plausible values is what’s known as a confidence interval, which will be the focus of the later sections of this reading.
1 Baby Birth Weights
Medical researchers may be interested in the relationship between a baby’s birth weight and the age of the mother, so as to provide medical interventions for specific age groups if it is found that they are associated with lower birth weights.
Every year, the US releases to the public a large data set containing information on births recorded in the country. The births14 dataset is a random sample of 1,000 cases from one such dataset released in 2014.
1.1 Observed data
Figure 1 visualizes the relationship between mage and weight for this sample of 1,000 birth records.
Table 1 displays the estimated regression coefficients for modeling the relationship between mage and weight for this sample of 1,000 birth records.
births_lm <- lm(weight ~ mage, data = births14)
get_regression_table(births_lm)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 6.793 | 0.208 | 32.651 | 0.000 | 6.385 | 7.201 |
| mage | 0.014 | 0.007 | 1.987 | 0.047 | 0.000 | 0.028 |
Based on these coefficients, the estimated regression equation is:
\[ \widehat{\text{birth weight}} = -6.793 + 0.014 \times \text{mother's age}\]
We will let \(\beta_1\) represent the slope of the relationship between baby’s birth weight and mother’s age for every baby born in the US in 2014. We will estimate \(\beta_1\) using the births14 dataset, labeling the estimate \(b_1\) (just as we did in Week 4).
A parameter is the value of the statistic of interest for the entire population.
We typically estimate the parameter using a “point estimate” from a sample of data. The point estimate is also referred to as the statistic.
1.2 Variability of the statistic
This sample of 1,000 births is only one of possibly tens of thousands of possible samples that could have been taken from the large dataset released in 2014. So, then we might wonder how different our regression equation would be if we had a different sample. There is no reason to believe that \(\beta_1\) is 0.279, but there is also no reason to believe that \(\beta_1\) is particularly far away from \(b_1 =\) 0.279.
Just this week you read about how estimates, such as \(b_1\), are prone to sampling variation – the variability from sample to sample. For example, if we took a different sample of 1,000 births, would we obtain a slope of exactly 0.279? No, that seems fairly unlikely. We might obtain a slope of 0.29 or 0.26, or even 0.35!
When we studied the effects of sampling variation, we took many samples, something that was easily done with a shovel and a bowl of red and white balls. In this case, however, how would we obtain another sample? Well, we would need to go to the source of the data—the large public dataset released in 2014—and take another random sample of 1,000 observations. Maybe we don’t have access to that original dataset of 2014 births, how could we study the effects of sampling variation using our single sample? We will do so using a technique known as bootstrap resampling with replacement, which we now illustrate.
1.3 Resampling once
Step 1: Print out 1,000 identically sized slips of paper (or post-it notes) representing the sample of 1,000 babies in our sample. On each piece of paper, write the mother’s age and the birth weight of the baby. Figure 2 displays six of these such papers.
Step 2: Put the 1,000 slips of paper into a hat as seen in Figure 3.
Step 3: Mix the hat’s contents and draw one slip of paper at random, as seen in Figure 4. Record the mother’s age and baby’s birth weight, as printed on the paper.